Add SWE-ZERO 12M dataset by neubig · Pull Request #258 · neulab/agent-data-protocol

neubig · 2026-06-03T14:35:39Z

Summary

Closes #257.

Adds the AlienKevin/SWE-ZERO-12M-trajectories dataset to ADP following the existing mini-swe-agent dataset patterns.

Dataset source

Source: https://huggingface.co/datasets/AlienKevin/SWE-ZERO-12M-trajectories
License: Apache-2.0
Split used: train
Size: 12,290,800 rollouts over 122,908 unique PRs, 3,222 repositories, 16 programming languages, and approximately 112B tokens according to the dataset card.

Files added

datasets/AlienKevin_SWE-ZERO-12M-trajectories/README.md
datasets/AlienKevin_SWE-ZERO-12M-trajectories/extract_raw.py
datasets/AlienKevin_SWE-ZERO-12M-trajectories/schema_raw.py
datasets/AlienKevin_SWE-ZERO-12M-trajectories/raw_to_standardized.py
datasets/AlienKevin_SWE-ZERO-12M-trajectories/metadata.json
datasets/AlienKevin_SWE-ZERO-12M-trajectories/sample_raw.json
datasets/AlienKevin_SWE-ZERO-12M-trajectories/sample_std.json
datasets/AlienKevin_SWE-ZERO-12M-trajectories/sample_sft/openhands_v0.json

Schema mapping summary

Skip raw system messages because they define mini-swe-agent formatting and execution-free constraints.
Convert initial raw user task messages to TextObservation(source="user").
Convert raw user messages beginning with Observation: to TextObservation(source="environment") with the prefix stripped.
Convert raw assistant messages with fenced bash blocks to CodeAction(language="bash"), preserving pre-command reasoning as the action description.
Preserve assistant messages without bash blocks as MessageAction entries.
Store instance_id, repo, trajectory_format, exit_status, and duration_sec in trajectory details.

Design decisions

Ambiguity: The source dataset has 100 independent rollouts per PR and repeats instance_id across rows.
- Chosen approach: Derive ADP IDs from instance_id plus a deterministic SHA-1 content hash.
- Example: rsteube__carapace-849 becomes IDs such as rsteube__carapace-849-f3b732c7f08f.
- Alternatives rejected: Using only instance_id would create duplicate sample IDs; adding a synthetic counter would depend on extraction position and be less stable.
Ambiguity: The dataset card says most trajectories are incomplete and explicitly frames the corpus as mid-training data rather than verified SFT data.
- Chosen approach: Preserve all non-empty trajectories rather than filtering to Submitted only.
- Example: Sample trajectories with exit_status: incomplete are standardized and converted to OpenHands v0 SFT.
- Alternatives rejected: Filtering to successful submissions would discard the bulk of the dataset and conflict with the dataset card's intended use.
Ambiguity: Raw observations are encoded as user messages prefixed with Observation:.
- Chosen approach: Treat these as environment observations and strip only the prefix.
- Example: Observation: ./example/cmd/_test/xonsh.py becomes an environment TextObservation containing ./example/cmd/_test/xonsh.py.
- Alternatives rejected: Leaving observations as user messages would lose tool-result structure; stripping more text could remove meaningful command output.
Ambiguity: Some assistant turns may not contain a valid bash block even though the prompt requests one.
- Chosen approach: Convert assistant turns without a bash block to MessageAction so malformed or terminal natural-language turns are preserved.
- Example: A plain assistant explanation remains a message action rather than being dropped.
- Alternatives rejected: Dropping these turns would alter trajectory semantics; inventing a command would introduce unsupported behavior.
Ambiguity: Assistant messages may contain reasoning before a bash command.
- Chosen approach: Preserve reasoning as CodeAction.description after removing a leading THOUGHT: label.
- Example: THOUGHT: I need to inspect files... becomes the code action description.
- Alternatives rejected: Keeping THOUGHT: in descriptions adds format noise; discarding the reasoning loses useful supervision.

Known limitations

The source trajectories are execution-free and are not verified against tests.
Many trajectories are incomplete or truncated at the source dataset's 15-turn cap.
Samples are intentionally small and generated from the beginning of the training stream.

Tests run

python -m pytest tests/test_dataset_structure.py tests/test_raw_schemas.py tests/test_standardized_schemas.py tests/test_std_to_sft_conversion.py -q
PATH=/home/openhands/.local/bin:$PATH python -m pytest tests/ -q

This PR was created by an AI agent (OpenHands) on behalf of the user.

@neubig can click here to continue refining the PR

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions

🟡 Acceptable overall — pipeline is clean, CI passes, and the schema mapping is well-documented. Two issues need to be addressed before merge.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟢 LOW — adds a new dataset directory only; no shared schema or converter changes.

Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/neulab/agent-data-protocol/actions/runs/26894417354

This review was generated by an AI agent (OpenHands) on behalf of the reviewer.

Add SWE-ZERO 12M dataset

1862a2b

Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai Bot mentioned this pull request Jun 3, 2026

Add 12M swe-zero dataset #257

Open

neubig marked this pull request as ready for review June 3, 2026 15:17

github-actions Bot requested changes Jun 3, 2026

View reviewed changes

Comment thread datasets/AlienKevin_SWE-ZERO-12M-trajectories/sample_raw.json

Comment thread datasets/AlienKevin_SWE-ZERO-12M-trajectories/raw_to_standardized.py Outdated

openhands-agent added 3 commits June 13, 2026 23:27

Merge remote-tracking branch 'origin/main' into pr-258

de290df

Migrate SWE-ZERO dataset to ATIF pipeline

ba63e1e

Address SWE-ZERO sample review feedback

93fe9c2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SWE-ZERO 12M dataset#258

Add SWE-ZERO 12M dataset#258
neubig wants to merge 4 commits into
mainfrom
openhands/add-swe-zero-12m

neubig commented Jun 3, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

neubig commented Jun 3, 2026

Summary

Dataset source

Files added

Schema mapping summary

Design decisions

Known limitations

Tests run

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants